class: center, middle, inverse, title-slide # Data Visualization for Economic Historians ### Jonathan Jayes ### LEAP Stellenbosch ### 2021/05/28 (updated: 2021-05-27) --- class: inverse, center, bottom background-image: url(images/LEAP_logo.png) ## Follow along with the slides - URL in Chat --- # Purpose .pull-left[ ### 1. Get you excited about data viz. ### 2. Show some tips and tricks that make your figures pop irrespective of software used ### 3. Evangelize R ] .pull-right[  .footnote[ Photo from [Hans Rosling's TED Talk](https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen) ] ] --- # Structure .pull-left[ ### 1. Exploratory Data Analysis ### 2. Why Data Viz? #### Break ### 3. Polishing Figures for Publication ### 4. Interactivity ### 5. Recreating Published Figures ] .pull-right[  ] --- class: center, top # Everything is a story ### /ˈstɔːri/  --- class: center, top  ###[Dan Harmon's Story circle](https://youtu.be/UdxX_Kljrq8) --- class: center, top ## Our Own Story Circle ### Figure here --- # 1. The zone of comfort Our stomping ground **MS Excel**:  --- # 1. The zone of comfort **MS Excel** has useful features that suggest how to visualize your data  --- # Why leave? .center[  ] --- class: inverse, center, middle # 2. You want something --- # You want to learn about your data by .panelset[ .panel[.panel-name[Counting things] ```r df <- read_rds("data/baptism_data.rds") df %>% count(female_profession, sort = T) ``` ``` ## # A tibble: 148 x 2 ## female_profession n ## <chr> <int> ## 1 <NA> 595 ## 2 Domestic duties 228 ## 3 Machinist 154 ## 4 Domestic service 46 ## 5 Domestic servant 41 ## 6 Household duties 41 ## 7 Clerk 30 ## 8 School Teacher 26 ## 9 Factory worker 24 ## 10 Shorthand Typist 21 ## # ... with 138 more rows ``` ] .panel[.panel-name[Drawing maps] <img src="index_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] .panel[.panel-name[Making plots] <!-- --> ] ] --- # You want to make beautiful charts with .panelset[ .panel[.panel-name[Informative colours]
] .panel[.panel-name[Interactive elements]
] .panel[.panel-name[Engaging animations] <center> <iframe width="500" height="500" src="https://www.youtube.com/embed/D_Aakjiadm4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] ] --- class: inverse, center, middle # 3. You enter an unfamiliar situation --- ## Where do we start? .pull-left[ - We will use some data that Laura Richardson has analysed. - It contains baptism records from seven Anglican churches in Cape Town between 1910 and 1960. - [Here is a link](https://scholar.sun.ac.za/handle/10019.1/109081) to her Master's thesis, an exploration of "**prenuptial pregnancy and unmarried motherhood** amongst a select group of South Africans living in Cape Town" in the first half of the twentieth century. ] .pull-right[  ] --- # What does the data look like?  --- # What does the data look like? - It has one row per child who is baptised (one row per observation) - Every column is a variable - It is a wide dataset --- # What does the data look like in R? ```r df <- read_rds("data/baptism_data.rds") head(df) ``` ``` ## # A tibble: 6 x 31 ## parish_name year_baptized month_baptized day_baptized christian_name surname_b ## <chr> <dbl> <ord> <chr> <chr> <chr> ## 1 Christ Chu~ 1921 Apr 8 Islay Deirdre Molteno ## 2 St Mark's,~ 1939 May 14 William Isaac Fisher ## 3 St Peter's~ 1914 Feb 21 Douglas Walte~ Nash ## 4 Christ Chu~ 1916 Nov 12 Gordon Rhodes Dowthwai~ ## 5 St James t~ 1914 Feb 8 Ronald Overton Beevers ## 6 St James t~ 1915 Dec 22 Hugh Andrew Starke ## # ... with 25 more variables: year_of_birth <chr>, month_of_birth <chr>, ## # day_of_birth <chr>, race <fct>, year_married <dbl>, month_married <dbl>, ## # day_married <chr>, female_age <dbl>, female_surname <chr>, ## # female_name <chr>, female_country <chr>, female_profession <chr>, ## # male_age <dbl>, male_surname <chr>, male_name <chr>, male_country <chr>, ## # male_profession <chr>, parish_address <chr>, lat <dbl>, long <dbl>, ## # birth_date <date>, baptism_date <date>, married_date <date>, ## # days_marriage_to_birth <dbl>, prenup_preg <chr> ``` --- # How can we quickly understand what it is made up of? ```r skimr::skim(df) ``` Table: Data summary | | | |:------------------------|:----| |Name |df | |Number of rows |1604 | |Number of columns |31 | |_______________________ | | |Column type frequency: | | |character |18 | |Date |3 | |factor |2 | |numeric |8 | |________________________ | | |Group variables |None | **Variable type: character** |skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace| |:-----------------|---------:|-------------:|---:|---:|-----:|--------:|----------:| |parish_name | 0| 1.00| 18| 29| 0| 7| 0| |day_baptized | 1| 1.00| 1| 2| 0| 31| 0| |christian_name | 0| 1.00| 4| 32| 0| 1541| 0| |surname_b | 0| 1.00| 3| 17| 0| 948| 0| |year_of_birth | 0| 1.00| 4| 4| 0| 50| 0| |month_of_birth | 0| 1.00| 1| 2| 0| 12| 0| |day_of_birth | 0| 1.00| 1| 2| 0| 31| 0| |day_married | 0| 1.00| 1| 2| 0| 31| 0| |female_surname | 0| 1.00| 3| 34| 0| 917| 0| |female_name | 0| 1.00| 3| 30| 0| 1206| 0| |female_country | 366| 0.77| 5| 19| 0| 18| 0| |female_profession | 595| 0.63| 2| 42| 0| 147| 0| |male_surname | 0| 1.00| 3| 16| 0| 908| 0| |male_name | 0| 1.00| 3| 27| 0| 1125| 0| |male_country | 358| 0.78| 5| 17| 0| 20| 0| |male_profession | 6| 1.00| 4| 60| 0| 513| 0| |parish_address | 0| 1.00| 43| 58| 0| 7| 0| |prenup_preg | 0| 1.00| 10| 12| 0| 2| 0| **Variable type: Date** |skim_variable | n_missing| complete_rate|min |max |median | n_unique| |:-------------|---------:|-------------:|:----------|:----------|:----------|--------:| |birth_date | 0| 1.00|1910-03-27 |1959-12-30 |1943-08-03 | 1518| |baptism_date | 76| 0.95|1910-07-29 |1960-05-08 |1944-07-26 | 1040| |married_date | 0| 1.00|1909-09-03 |1959-04-11 |1941-08-02 | 1301| **Variable type: factor** |skim_variable | n_missing| complete_rate|ordered | n_unique|top_counts | |:--------------|---------:|-------------:|:-------|--------:|:--------------------------------------| |month_baptized | 74| 0.95|TRUE | 12|Feb: 157, Dec: 153, Nov: 150, Mar: 143 | |race | 0| 1.00|FALSE | 3|Col: 1096, Whi: 457, Bla: 51 | **Variable type: numeric** |skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist | |:----------------------|---------:|-------------:|-------:|------:|-------:|-------:|-------:|-------:|-------:|:----------------------------------------| |year_baptized | 1| 1| 1940.64| 12.81| 1910.00| 1931.00| 1943.00| 1951.00| 1960.00|▃▃▅▇▇ | |year_married | 0| 1| 1938.69| 12.94| 1909.00| 1929.00| 1941.00| 1950.00| 1959.00|▃▅▆▇▇ | |month_married | 0| 1| 6.92| 3.58| 1.00| 4.00| 7.00| 10.00| 12.00|▅▅▅▃▇ | |female_age | 3| 1| 22.60| 4.32| 16.00| 20.00| 22.00| 25.00| 55.00|▇▃▁▁▁ | |male_age | 2| 1| 25.67| 5.49| 17.00| 22.00| 24.00| 28.00| 60.00|▇▅▁▁▁ | |lat | 0| 1| -33.96| 0.03| -34.00| -33.97| -33.97| -33.95| -33.90|▂▇▃▁▂ | |long | 0| 1| 18.49| 0.03| 18.39| 18.47| 18.50| 18.50| 18.59|▁▂▇▁▁ | |days_marriage_to_birth | 0| 1| 581.28| 639.61| 1.00| 150.00| 337.00| 753.00| 3615.00|▇▂▁▁▁ | --- # What can we see? .pull-left[ ```r skimr::skim(df) ``` Table: Data summary | | | |:------------------------|:----| |Name |df | |Number of rows |1604 | |Number of columns |31 | |_______________________ | | |Column type frequency: | | |character |18 | |Date |3 | |factor |2 | |numeric |8 | |________________________ | | |Group variables |None | **Variable type: character** |skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace| |:-----------------|---------:|-------------:|---:|---:|-----:|--------:|----------:| |parish_name | 0| 1.00| 18| 29| 0| 7| 0| |day_baptized | 1| 1.00| 1| 2| 0| 31| 0| |christian_name | 0| 1.00| 4| 32| 0| 1541| 0| |surname_b | 0| 1.00| 3| 17| 0| 948| 0| |year_of_birth | 0| 1.00| 4| 4| 0| 50| 0| |month_of_birth | 0| 1.00| 1| 2| 0| 12| 0| |day_of_birth | 0| 1.00| 1| 2| 0| 31| 0| |day_married | 0| 1.00| 1| 2| 0| 31| 0| |female_surname | 0| 1.00| 3| 34| 0| 917| 0| |female_name | 0| 1.00| 3| 30| 0| 1206| 0| |female_country | 366| 0.77| 5| 19| 0| 18| 0| |female_profession | 595| 0.63| 2| 42| 0| 147| 0| |male_surname | 0| 1.00| 3| 16| 0| 908| 0| |male_name | 0| 1.00| 3| 27| 0| 1125| 0| |male_country | 358| 0.78| 5| 17| 0| 20| 0| |male_profession | 6| 1.00| 4| 60| 0| 513| 0| |parish_address | 0| 1.00| 43| 58| 0| 7| 0| |prenup_preg | 0| 1.00| 10| 12| 0| 2| 0| **Variable type: Date** |skim_variable | n_missing| complete_rate|min |max |median | n_unique| |:-------------|---------:|-------------:|:----------|:----------|:----------|--------:| |birth_date | 0| 1.00|1910-03-27 |1959-12-30 |1943-08-03 | 1518| |baptism_date | 76| 0.95|1910-07-29 |1960-05-08 |1944-07-26 | 1040| |married_date | 0| 1.00|1909-09-03 |1959-04-11 |1941-08-02 | 1301| **Variable type: factor** |skim_variable | n_missing| complete_rate|ordered | n_unique|top_counts | |:--------------|---------:|-------------:|:-------|--------:|:--------------------------------------| |month_baptized | 74| 0.95|TRUE | 12|Feb: 157, Dec: 153, Nov: 150, Mar: 143 | |race | 0| 1.00|FALSE | 3|Col: 1096, Whi: 457, Bla: 51 | **Variable type: numeric** |skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist | |:----------------------|---------:|-------------:|-------:|------:|-------:|-------:|-------:|-------:|-------:|:----------------------------------------| |year_baptized | 1| 1| 1940.64| 12.81| 1910.00| 1931.00| 1943.00| 1951.00| 1960.00|▃▃▅▇▇ | |year_married | 0| 1| 1938.69| 12.94| 1909.00| 1929.00| 1941.00| 1950.00| 1959.00|▃▅▆▇▇ | |month_married | 0| 1| 6.92| 3.58| 1.00| 4.00| 7.00| 10.00| 12.00|▅▅▅▃▇ | |female_age | 3| 1| 22.60| 4.32| 16.00| 20.00| 22.00| 25.00| 55.00|▇▃▁▁▁ | |male_age | 2| 1| 25.67| 5.49| 17.00| 22.00| 24.00| 28.00| 60.00|▇▅▁▁▁ | |lat | 0| 1| -33.96| 0.03| -34.00| -33.97| -33.97| -33.95| -33.90|▂▇▃▁▂ | |long | 0| 1| 18.49| 0.03| 18.39| 18.47| 18.50| 18.50| 18.59|▁▂▇▁▁ | |days_marriage_to_birth | 0| 1| 581.28| 639.61| 1.00| 150.00| 337.00| 753.00| 3615.00|▇▂▁▁▁ | ] .pull-right[ - Number of baptisms (1604) - What data is missing? - For `female_profession`, `female_country` and `male_country` we are missing data for more than 20 percent of our observations ] --- # How many children are baptized in each church? ```r df %>% count(parish_name, sort = TRUE) ``` ``` ## # A tibble: 7 x 2 ## parish_name n ## <chr> <int> ## 1 St Mark's, Athlone 799 ## 2 St Peter's, Mowbray 307 ## 3 Christ Church, Kenilworth 211 ## 4 St James the Great, Sea Point 83 ## 5 St Anne's, Maitland 82 ## 6 St Margaret's, Parow 72 ## 7 St Cyprian's, Langa 50 ``` --- # When are these children baptized? ```r df %>% group_by(year_baptized, parish_name) %>% summarise(number_of_baptisms = n()) %>% ungroup() ``` ``` ## # A tibble: 167 x 3 ## year_baptized parish_name number_of_baptisms ## <dbl> <chr> <int> ## 1 1910 St James the Great, Sea Point 3 ## 2 1910 St Peter's, Mowbray 3 ## 3 1911 Christ Church, Kenilworth 2 ## 4 1911 St James the Great, Sea Point 9 ## 5 1912 St James the Great, Sea Point 5 ## 6 1912 St Peter's, Mowbray 4 ## 7 1913 Christ Church, Kenilworth 2 ## 8 1913 St James the Great, Sea Point 13 ## 9 1913 St Peter's, Mowbray 6 ## 10 1914 Christ Church, Kenilworth 1 ## # ... with 157 more rows ``` --- # How can we vizualize this? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% group_by(year_baptized, parish_name) %>% mutate(number_of_baptisms = n()) %>% ungroup() %>% ggplot(aes(x = year_baptized, y = number_of_baptisms, colour = parish_name)) + geom_line() + scale_color_brewer(palette = "Dark2") + labs(x = NULL, y = "Number of baptisms per year", colour = "Parish name") ``` ] ] --- # What do we like and dislike? .pull-left[ <!-- --> ] .pull-right[ ### Improvements: ## .can-edit[idea] ## .can-edit[idea] ] --- # How can we make it clearer? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% group_by(year_baptized, parish_name) %>% mutate(number_of_baptisms = n()) %>% ungroup() %>% ggplot(aes(x = year_baptized, y = number_of_baptisms, colour = parish_name)) + geom_line() + scale_color_brewer(palette = "Dark2") + labs(x = NULL, y = "Number of baptisms per year", colour = "Parish name") + * facet_wrap(~ parish_name) + * theme(legend.position = "none") ``` ] ] --- # Takeaways .pull-left[ <!-- --> ] .pull-right[ - **Christ Church, Kenilworth** and **St Peter's, Mowbray** have baptism records for the entire period - Others have only periodic data - The bulk of our data comes from **St Mark's, Athlone**, and begins later in the series - We must think about these factors when examining time trends ] --- class: inverse, center, middle # 4. You adapt to it --- # Our question: -- ### What characterises the couples who conceieved a child before marriage? --- ## What are the types of pregnancies? .pull-left[ ### Conventional pregnancies - Child is conceived after marriage ### Prenuptial pregnancy - Child is conceived before marriage ] .pull-right[  ] --- ## How can we tell what type of pregnancy we have for each record? -- ### We have the date of birth and date of marriage -- ### We can do some subtraction -- <img src="images/baby.jpg" width="80%" style="display: block; margin: auto;" /> --- ## What's the plan? -- ### Get the date that the parents were married -- ### Get the date that the child was born -- ### If there is fewer than 280 days between marriage and birth, we know the child was most likely a prenuptual pregnancy. --- ### Dealing with dates Dates are difficult: - There are three parts to the [rule determining a leap year](https://r4ds.had.co.nz/dates-and-times.html) - There are many formats to write a date in (e.g. dmy vs mdy) The best resource for dealing with dates in R is in [Chapter 16](https://r4ds.had.co.nz/dates-and-times.html) of [R for Data Science](https://r4ds.had.co.nz/index.html) entitled **Dates and times**. ### We will explore a brief example with our baptism data. --- ## When are the children born? ### The months are recorded differently! .pull-left[ ```r df %>% count(month_baptized, sort = T) %>% head() ``` ``` ## # A tibble: 6 x 2 ## month_baptized n ## <ord> <int> ## 1 Feb 157 ## 2 Dec 153 ## 3 Nov 150 ## 4 Mar 143 ## 5 Apr 126 ## 6 Aug 123 ``` ] .pull-right[ ```r df %>% count(month_of_birth, sort = T) %>% head() ``` ``` ## # A tibble: 6 x 2 ## month_of_birth n ## <chr> <int> ## 1 9 159 ## 2 10 145 ## 3 11 145 ## 4 12 141 ## 5 1 136 ## 6 8 135 ``` ] .footnote[ Note that the months are recorded as characters not numbers, a result of importing them from Excel. ] --- ### When are the children born? ```r library(glue) library(lubridate) df <- df %>% # this glues together three individual variables into # one new variable called `birth_date` mutate(birth_date = glue("{day_of_birth}-{month_of_birth}-{year_of_birth}"), # dmy here means read the date as "day-month-year" birth_date = dmy(birth_date)) df %>% select(christian_name, birth_date) %>% head() ``` ``` ## # A tibble: 6 x 2 ## christian_name birth_date ## <chr> <date> ## 1 Islay Deirdre 1921-02-07 ## 2 William Isaac 1939-02-20 ## 3 Douglas Walter Ellis 1914-01-22 ## 4 Gordon Rhodes 1916-10-22 ## 5 Ronald Overton 1914-01-13 ## 6 Hugh Andrew 1915-11-09 ``` --- ## When are the children baptized? ```r df <- df %>% mutate(baptism_date = glue("{day_baptized}-{month_baptized}-{year_baptized}"), baptism_date = dmy(baptism_date)) ``` ``` ## Warning: 76 failed to parse. ``` ```r df %>% select(christian_name, birth_date, baptism_date) %>% head() ``` ``` ## # A tibble: 6 x 3 ## christian_name birth_date baptism_date ## <chr> <date> <date> ## 1 Islay Deirdre 1921-02-07 1921-04-08 ## 2 William Isaac 1939-02-20 1939-05-14 ## 3 Douglas Walter Ellis 1914-01-22 1914-02-21 ## 4 Gordon Rhodes 1916-10-22 1916-11-12 ## 5 Ronald Overton 1914-01-13 1914-02-08 ## 6 Hugh Andrew 1915-11-09 1915-12-22 ``` --- ### Why did 76 dates fail? ```r df %>% filter(is.na(baptism_date)) %>% select(year_baptized:day_baptized) ``` ``` ## # A tibble: 76 x 3 ## year_baptized month_baptized day_baptized ## <dbl> <ord> <chr> ## 1 1937 Jun <NA> ## 2 NA Sep 30 ## 3 1952 <NA> 14 ## 4 1924 <NA> 5 ## 5 1921 <NA> 18 ## 6 1927 <NA> 4 ## 7 1922 <NA> 5 ## 8 1920 <NA> 24 ## 9 1922 <NA> 1 ## 10 1923 <NA> 5 ## # ... with 66 more rows ``` We see a number of NAs ### Data quality issues are important to catch! --- ## When are the parents married? ```r df <- df %>% mutate(married_date = glue("{day_married}-{month_married}-{year_married}"), married_date = dmy(married_date)) df %>% select(christian_name, birth_date, married_date) %>% head() ``` ``` ## # A tibble: 6 x 3 ## christian_name birth_date married_date ## <chr> <date> <date> ## 1 Islay Deirdre 1921-02-07 1916-12-15 ## 2 William Isaac 1939-02-20 1938-11-12 ## 3 Douglas Walter Ellis 1914-01-22 1912-12-18 ## 4 Gordon Rhodes 1916-10-22 1914-08-18 ## 5 Ronald Overton 1914-01-13 1912-04-24 ## 6 Hugh Andrew 1915-11-09 1915-02-17 ``` --- ### How many prenuptial pregnancies do we have? ```r df <- df %>% mutate(days_marriage_to_birth = as.numeric(as.duration(birth_date - married_date), "days"), # if gap between parents' marriage and child's birth # is less than 280 days we classify it as a prenup_preg prenup_preg = ifelse(days_marriage_to_birth < 280, "Prenuptual", "Conventional")) df %>% count(prenup_preg, sort = T) ``` ``` ## # A tibble: 2 x 2 ## prenup_preg n ## <chr> <int> ## 1 Conventional 949 ## 2 Prenuptual 655 ``` --- ### What does the distribution of pregnancies look like by parish? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(days_marriage_to_birth, fill = parish_name)) + geom_density(show.legend = F) + geom_vline(xintercept = 280, lty = 2) + facet_wrap(~ parish_name) + scale_fill_brewer(palette = "Dark2") + scale_x_continuous(limits = c(0, 1000)) + theme(axis.text.y = element_blank()) + labs(x = "Days from parents' marriage to birth of child", y = NULL, caption = "Note: dotted line indicates 280 days") ``` ] ] --- ### What does the distribution of pregnancies look like over time? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r library(ggridges) df %>% mutate(decade = year_baptized - year_baptized %% 10, decade = as.factor(str_c(decade, "s"))) %>% select(days_marriage_to_birth, decade) %>% filter(!is.na(decade)) %>% ggplot(aes(days_marriage_to_birth, y = decade, fill = decade)) + geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1., show.legend = F) + geom_vline(xintercept = 280, lty = 2) + scale_fill_brewer(palette = "Dark2") + scale_x_continuous(limits = c(0, 1000)) + labs(x = "Days from parents' marriage to birth of child", y = NULL, caption = "Note: dotted line indicates 280 days") ``` ] ] --- ## Is our story clear? .pull-left[ <!-- --> ] .pull-right[ ### What can we see ## .can-edit[idea] ### What can we not see? ## .can-edit[idea] ] --- ### How many pregnancies of each type do we have over time? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(decade = year_baptized - year_baptized %% 10, decade = as.factor(str_c(decade, "s"))) %>% select(prenup_preg, decade) %>% filter(!is.na(decade)) %>% group_by(decade, prenup_preg) %>% summarise(n_obs = n()) %>% ungroup() %>% ggplot(aes(y = decade, x = n_obs, fill = as.factor(prenup_preg))) + geom_col(position = "fill", show.legend = F) + scale_x_continuous(labels = percent_format()) + scale_fill_manual(values = c("#af8dc3", "#7fbf7b")) + labs(y = NULL, x = NULL, title = "Share of <span style='color:#7fbf7b'>Prenuptual</span> and <span style='color:#af8dc3'>Conventional</span> Pregnancies by decade") + theme(plot.title = element_markdown()) ``` ] ] --- ## What are the ages of the parents? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(x = male_age, y = female_age, colour = prenup_preg)) + geom_abline() + geom_jitter(alpha = .4, show.legend = F) + geom_smooth(se = F, show.legend = F) + scale_colour_manual(values = c("#af8dc3", "#7fbf7b")) + labs(x = "Age of husband", y = "Age of wife", colour = "Type of pregnancy", title = "Partner's age comparison between <span style='color:#7fbf7b'>Prenuptual</span> and<br><span style='color:#af8dc3'>Conventional</span> pregnancies") + theme(plot.title = element_markdown()) ``` ] ] --- ## Do ages differ by type of pregnancy? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r library(patchwork) # This allows us to paste the two plots below one another a <- df %>% ggplot(aes(male_age, fill = prenup_preg)) + geom_density(alpha = .5) + scale_fill_brewer(palette = "Pastel2") + scale_x_continuous(limits = c(NA, 40)) + labs(x = "Husband's age", y = NULL, fill = "Type of pregnancy") b <- df %>% ggplot(aes(female_age, fill = prenup_preg)) + geom_density(alpha = .5) + scale_fill_brewer(palette = "Pastel2") + scale_x_continuous(limits = c(NA, 40)) + labs(x = "Wife's age", y = NULL, fill = "Type of pregnancy") patchwork <- (a / b) patchwork + plot_annotation(title = "More younger partners have prenuptual pregnancies than old:") + theme(plot.title = element_markdown()) ``` ] ] --- ## Is the age gap between partners correlated with prenuptual pregnancy? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(age_gap = male_age - female_age) %>% ggplot(aes(age_gap, fill = prenup_preg)) + geom_density(alpha = .5) + geom_vline(xintercept = 0, lty = 2) + theme(legend.position = "bottom") + labs(x = "Age gap in years\n(husband's age minus wife's age)", y = NULL, title = "There appears to be little difference in partners' age gaps\nbetween conventional and prenuptual pregnancies", fill = "Type of pregnancy", caption = "Note: dotted line represents partners of equal age") ``` ] ] --- --- ### Dealing with categories #### Often we will have many different categories - For example, in our dataset we have 147 different occupations for women ```r df %>% filter(!is.na(female_profession)) %>% count(female_profession, sort = T) ``` ``` ## # A tibble: 147 x 2 ## female_profession n ## <chr> <int> ## 1 Domestic duties 228 ## 2 Machinist 154 ## 3 Domestic service 46 ## 4 Domestic servant 41 ## 5 Household duties 41 ## 6 Clerk 30 ## 7 School Teacher 26 ## 8 Factory worker 24 ## 9 Shorthand Typist 21 ## 10 Shop assistant 20 ## # ... with 137 more rows ``` --- ### It's not possible to put them all on one chart .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% filter(!is.na(female_profession)) %>% count(female_profession, sort = T) %>% mutate(female_profession = fct_reorder(female_profession, n)) %>% ggplot(aes(x = n, y = female_profession)) + geom_col() + labs(x = "Number of observations", y = NULL) ``` ] ] --- # What do we like and dislike? .pull-left[ <!-- --> ] .pull-right[ ### Improvements: ## .can-edit[idea] ## .can-edit[idea] ] --- ### Dealing with categories We can `filter` for only professions with 6 or more observations. .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% filter(!is.na(female_profession)) %>% count(female_profession, sort = T) %>% mutate(female_profession = fct_reorder(female_profession, n)) %>% * filter(n > 6) %>% ggplot(aes(x = n, y = female_profession, * fill = str_detect(female_profession, "Domestic"))) + geom_col() + labs(x = "Number of observations", y = NULL, fill = "Contains 'Domestic'", title = "Most common female professions", * caption = "Limited to professions with more than 5 observations") + scale_fill_brewer(palette = "Dark2") + * theme(legend.position = "right") ``` ] ] --- ### Dealing with categories Or we can choose 12 categories and lump with `fct_lump` the remaining categories into "Other". .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r library(tidytext) df %>% filter(!is.na(female_profession)) %>% * mutate(female_profession = fct_lump(female_profession, 12)) %>% count(female_profession, sort = T) %>% mutate(female_profession = fct_reorder(female_profession, n), female_profession = fct_relevel(female_profession, "Other")) %>% ggplot(aes(x = n, y = female_profession, * fill = str_detect(female_profession, "Domestic"))) + geom_col() + labs(x = "Number of observations", y = NULL, fill = "Contains 'Domestic'", title = "Most common female professions", * caption = "Limited to professions with more than 5 observations") + scale_fill_brewer(palette = "Dark2") + * theme(legend.position = "right") ``` ] ] --- ### Dealing with categories Or we can `group_by` more categories, for example, race. .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r library(tidytext) df %>% * filter(!is.na(female_profession)) %>% group_by(race) %>% count(female_profession, sort = T) %>% slice_max(n, n = 6, with_ties = F) %>% ungroup() %>% * mutate(female_profession = reorder_within(female_profession, n, race)) %>% ggplot(aes(x = n, y = female_profession, * fill = race)) + geom_col() + * scale_y_reordered() + * facet_wrap(~ race, scales = "free") + labs(x = "Number of observations", y = NULL, fill = "Domestic", * title = "Five most common female professions by race") + scale_fill_brewer(palette = "Spectral") + theme(legend.position = "none") ``` ] ] --- ## Where are the churches? ```r df %>% distinct(parish_address) ``` ``` ## # A tibble: 7 x 1 ## parish_address ## <chr> ## 1 16 Summerley Rd, Kenilworth, Cape Town, 7708, South Africa ## 2 30 Bamford Ave, Athlone, Cape Town, 7760, South Africa ## 3 Victoria Rd, Mowbray, Cape Town, 7700, South Africa ## 4 St James Rd, Sea Point, Cape Town, 8005, South Africa ## 5 2B Papu St, Langa, Cape Town, 7456, South Africa ## 6 25 Jansen St, Cape Town, 7500, South Africa ## 7 3 Fontana St, Brooklyn, Cape Town, 7405, South Africa ``` ## But how can we plot these addresses on a map? --- ## Using the `tidygeocoder` package to get coordinates: ```r library(tidygeocoder) locations <- df %>% select(parish_address) %>% # `distinct` means take only one of each address distinct() %>% tidygeocoder::geocode(address = parish_address, method = "osm") # Here "osm" stands for Open Street Map # The package also supports Google Maps, amongst others ``` [Here is a link](https://github.com/jessecambon/tidygeocoder) to the `tidygeocoder` package. --- ## Using the `tidygeocoder` package to get coordinates: This is what we get back: ```r locations ``` ``` ## # A tibble: 7 x 3 ## parish_address lat long ## <chr> <dbl> <dbl> ## 1 16 Summerley Rd, Kenilworth, Cape Town, 7708, South Africa -34.0 18.5 ## 2 30 Bamford Ave, Athlone, Cape Town, 7760, South Africa -34.0 18.5 ## 3 Victoria Rd, Mowbray, Cape Town, 7700, South Africa -33.9 18.5 ## 4 St James Rd, Sea Point, Cape Town, 8005, South Africa -33.9 18.4 ## 5 2B Papu St, Langa, Cape Town, 7456, South Africa -33.9 18.5 ## 6 25 Jansen St, Cape Town, 7500, South Africa -33.9 18.6 ## 7 3 Fontana St, Brooklyn, Cape Town, 7405, South Africa -33.9 18.5 ``` --- # Where are these points? ```r df %>% mutate(parish_address = str_replace_all(parish_address, ",", ",\n")) %>% distinct(lat, long, parish_address) %>% ggplot(aes(x = long, y = lat, # tell `geom_text` to display the address as text label = parish_address)) + geom_point() + geom_text() ``` <!-- --> --- --- ### Plotting results .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(prenup_preg = as.numeric(as.factor(prenup_preg))) %>% lm(prenup_preg ~ female_age + male_age + race + parish_name + year_baptized, data = .) %>% tidy(conf.int = T) %>% filter(term != "(Intercept)") %>% mutate(type = case_when( str_detect(term, "parish_name") ~ "Parish FE", str_detect(term, "race") ~ "Race", str_detect(term, "age") ~ "Age", TRUE ~ "Time trend" )) %>% mutate(term = str_remove_all(term, "race|parish_name"), term = str_replace_all(term, "_", " "), term = str_to_title(term)) %>% ggplot(aes(estimate, y = term, colour = type)) + geom_point() + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term)) + geom_vline(xintercept = 0, lty = 2) +labs(x = NULL, y = NULL, colour = "Var type") ``` ] ] --- class: inverse, center, middle # 5. You get what you wanted --- # What charts to use -- ## Choosing charts for exploratory data analysis isn't hard -- ### Stick to the winners: - line charts - scatter plots - column charts - density plots -- ### Sometimes nice: - small multiples - simple maps --- class: inverse, center, middle # 6. You pay a heavy price --- <iframe width="560" height="315" src="https://www.youtube.com/embed/5Zg-C8AAIGg?controls=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- ## Polishing figures for publication -- ### Maps ### Time on a third axis ### Giving context with colour ### Interactivity ### Reproducing figures for publication --- # Making maps Contextual maps. Chloropleths. Animated maps. --- # Coffee rating map .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df <- read_rds("data/coffee_data.rds") ggplot(data = df) + borders() + geom_sf(aes(fill = mean_country_rating, geometry = geometry)) + scale_fill_viridis_c(trans = "sqrt") + theme_void() + theme(legend.position = "bottom") + coord_sf(ylim = c(-50, 80)) + facet_wrap(~ species, nrow = 2) + labs(title = "Average Coffee Rating", subtitle = "By country and species", fill = "Average Rating / 100", caption = "Data: James LeDoux - https://github.com/jldbc/coffee-quality-database", x = "", y = "") ``` ] ] --- ## What are the problems with this map? -- - Lots of missing data -- - Only some countries produce coffee, very few produce Robusta beans -- - **Crucially, we cannot see the countries with small land area** --- ### What shall we do about this? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r library(tidytext) df %>% select(-geometry) %>% mutate(country_name = countrycode::countrycode(iso_a3, "iso3c", "country.name")) %>% mutate(country_name = reorder_within(country_name, mean_country_rating, species)) %>% group_by(species) %>% slice_max(mean_country_rating, n = 10) %>% ungroup() %>% ggplot(aes(x = mean_country_rating, y = country_name, fill = region_un)) + geom_col() + facet_wrap(~ species, scales = "free") + scale_y_reordered() + coord_cartesian(xlim = c(70, 90)) + labs(x = "Mean coffee rating", y = NULL, fill = "Region") ``` ] ] --- ## The trusty column chart gets us some of the way there -- - We still have some information about the geography of the countries that produce the best coffee -- - The chart is very legible and the comparisons are clear -- - But, we have lost the effect of latitude that we saw in Africa. --- ## We may care about spatial autocorrelation We see in a zoomed in section that as we move North and towards the Ethiopian highlands, the coffee rating improves.
--- class: inverse, center, middle # How can we show this on a map? ## But still retain the legibility of the column chart? --- class: inverse, center, middle # Represent each country with the same size ## And arrange the countries in 'roughly' the correct place geographically --- .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df <- read_rds("data/coffee_map_tiles.rds") df %>% ggplot(aes(xmin = x, ymin = y, xmax = x + 1, ymax = y + 1)) + geom_rect(color = "#ffffff") + geom_rect(color = "#ffffff", aes(fill = mean_country_rating)) + scale_fill_viridis_c(trans = "sqrt") + geom_text(aes(x = x, y = y, label = alpha.2), color = "#ffffff", alpha = 0.5, nudge_x = 0.5, nudge_y = 0.5, size = 2.8) + theme(panel.grid = element_blank()) + theme_void() + theme(legend.position = "bottom") + coord_equal() + labs(fill = "Average Arabica coffee rating by country") ``` ] ] --- ## What have we improved? .pull-left[ <!-- --> ] .pull-right[ - We can see the spatial correlation in Africa - Geographically small countries are not forgotten - Comparing countries is not difficult with the diverging colour scale ] --- # Time on a third axis Sometimes it is interesting to show an evolution of two variables over time. For example, say we have information on fertility rates and the share of children born outside of marriage in Europe. The data looks like this: ```r df <- read.csv("data/df_denmark_greece.csv") %>% as_tibble() df %>% head() ``` ``` ## # A tibble: 6 x 4 ## country year tfr pbom ## <chr> <int> <dbl> <dbl> ## 1 Denmark 1960 2.57 7.8 ## 2 Denmark 1961 2.55 8 ## 3 Denmark 1962 2.55 8.3 ## 4 Denmark 1963 2.67 8.9 ## 5 Denmark 1964 2.6 9.3 ## 6 Denmark 1965 2.61 9.5 ``` --- ## The obvious choice is a line chart .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% # gives meaningful variable names rename(`Total fertility rate` = tfr, `Proportion of births outside marriage` = pbom) %>% # makes it into a longer dataset so that we can facet # wrap by indicator pivot_longer(-c(country, year), names_to = "indicator") %>% ggplot(aes(year, value, colour = country)) + # here we say nrow = 2 so that they are above one another facet_wrap(~ indicator, nrow = 2, scales = "free_y") + geom_line() + # remove unnecessary axis labels labs(y = NULL, x = "Year", colour = "Country") + theme(legend.position = "right") ``` ] ] --- ### An alternative .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(year_flag = ifelse(test = year %% 7 == 0, yes = year, no = NA)) %>% ggplot(aes(tfr, pbom, colour = country, label = year_flag)) + geom_point() + geom_segment(aes(x= tfr, y= pbom, xend=c(tail(tfr, n=-1), NA), yend=c(tail(pbom, n=-1), NA)), arrow = arrow(length=unit(0.25,"cm"))) + geom_text_repel(colour = "black") + scale_y_continuous(labels = percent_format(scale = 1)) + scale_colour_manual(values = c("#C60C30", "#0D5EAF")) + labs(x = "Total fertility rate", y = "Proportion of births\noutside of marriage", title = "Fertility vs births outside of marriage in <span style='color:#C60C30'>Denmark</span> and <span style='color:#0D5EAF'>Greece</span>") + theme(legend.position = "none", plot.title = element_markdown(size = 18)) ``` ] ] --- ## What have we learned? .pull-left[ <!-- --> ] .pull-right[ - Both countries saw a large drop in fertility from the 1960s until the 1980s - In Denmark, after 1970 we see an increase in the share of children born outside of marriage - In contrast, Greek families have relatively few children outside of marriage. - After 1990, Danish fertility increased from 1.3 to 1.8, while Greek fertility remained at 'lowest-low' levels, below replacement. ] --- ## What have we changed? .pull-left[ <!-- --> ] .pull-right[ - Indicators on the x- and y-axis and then show time with text labels - Legend is replaced with colour coded title - Colours have meaning (main colour of country flag) - Percentage labels on the y-axis ] --- # Giving context with colour ## Sometimes we may want to show a particular series of data in its correct context. ### For instance, in our line graph above which showed the evoltuion of the share of births outside of marriage in **Denmark and Greece**, we might want to know if these two represent the **extremes** within Europe. --- ## Do **Denmark and Greece** represent the **extremes** of the share of children born outside of marriage in Europe? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df <- read_rds("data/births_outside_marriage.rds") df %>% filter(flag == 1) %>% ggplot(aes(x = year, y = pbom, group = country, colour = country)) + geom_line() + scale_y_continuous(labels = percent_format(scale = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") ``` ] ] --- ### One way to do this would be to show an average and then how these compare .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% * group_by(year) %>% * mutate(mean_pbom = mean(pbom, na.rm = T)) %>% * ungroup() %>% ggplot() + geom_line(aes(x = year, y = pbom, group = country, colour = country), data = df %>% filter(flag == 1)) + * geom_line(aes(x = year, y = mean_pbom, colour = "European average")) + scale_y_continuous(labels = percent_format(scale = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") ``` ] ] --- ### We can improve upon this by introducing an interval ribbon .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% group_by(year) %>% mutate(mean_pbom = mean(pbom, na.rm = T), * pct_10 = quantile(pbom, .1, na.rm = T), * pct_90 = quantile(pbom, .9, na.rm = T)) %>% ungroup() %>% ggplot() + geom_line(aes(x = year, y = pbom, group = country, colour = country), data = df %>% filter(flag == 1)) + * geom_ribbon(aes(x = year, ymin = pct_10, ymax = pct_90, * fill = "Interval \n(10th to 90th percentile)"), alpha = .3) + geom_line(aes(x = year, y = mean_pbom, colour = "European average")) + scale_y_continuous(labels = percent_format(scale = 1)) + * scale_fill_manual(values = "#90ee90") + * guides(fill = guide_legend(order = 2), * col = guide_legend(order = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country", * fill = "") ``` ] ] --- ### An alternative shows the series for each country .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(x = year, y = pbom, group = country, colour = country)) + geom_line() + scale_y_continuous(labels = percent_format(scale = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") + theme(legend.text = element_text(size = 4), axis.title = element_text(size = 8), legend.position = "bottom") ``` ] ] --- class: center, middle, inverse # This is silly --- class: center, middle, inverse ## The legend is gigantic ## There are far too many colours ## We cannot see Denmark or Greece --- ### We can **highlight** the counties we are interested in And gray out the other lines .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r library(gghighlight) df %>% ggplot() + geom_line(aes(x = year, y = pbom, colour = country)) + gghighlight(flag == 1) + scale_y_continuous(labels = percent_format(scale = 1)) + theme(legend.text = element_text(size = 4), legend.position = "bottom", plot.title = element_markdown(size = 14)) + scale_colour_manual(values = c("#C60C30", "#0D5EAF")) + labs(title = "Proportion of births outside of marriage in <span style='color:#C60C30'>Denmark</span> and <span style='color:#0D5EAF'>Greece</span>", y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") ``` ] ] --- # Advantages of gghighlight .pull-left[ <!-- --> ] .pull-right[ - Shows each of the series - We can see that Denmark is a leader in the beginning, but is caught up by other nations - Does not hide outliers - Makes clear the trends in your countries of interest ] --- # Interactivity Publishing your research on the web is a good way to get it out there. Making your visualizations interactive is a fantastic way to have your readers explore the story you tell. --- .panelset[ .panel[.panel-name[Plot]
] .panel[.panel-name[Code] ```r library(leaflet) library(glue) df <- read_rds("data/baptism_data.rds") %>% group_by(parish_name) %>% mutate(n_obs = n()) %>% ungroup() df <- df %>% # this popup will contain the parsh name in bold (<b> ... </b> makes bold) # and the address in the next line (with a <br/> break) mutate(popup = glue("<b>{parish_name}</b><br/>Located at {parish_address},<br/>this parish has <b>{n_obs}</b> baptisms in our dataset.")) df %>% # makes a leaflet map leaflet() %>% # adds a background addTiles() %>% # makes the map grey in colour addProviderTiles(providers$CartoDB.Positron) %>% # sets the centre of the map with the mean values of lat and long setView(mean(df$long), mean(df$lat), zoom = 11) %>% # here we add a marker with the popup specified above, inclduing the name of the parish and its address. addMarkers(unique(df$long), unique(df$lat), popup = unique(df$popup)) ``` ] ] --- ## Interactive plots ### Have a look at [Claus Wilke's slides here](https://wilkelab.org/SDS375/slides/interactive-plots.html#1) for more info on Interactive plots
--- ### Recreating published figures  --- ### How do you recreate it without the underlying data? .panelset[ .panel[.panel-name[Plot] <!-- --> ] .panel[.panel-name[Code] ```r df <- read.csv("data/Fouquet_Broadberry.csv") %>% as_tibble() order <- df %>% group_by(country) %>% filter(x == max(x)) %>% arrange(desc(y)) %>% ungroup() %>% select(country) %>% as.list() df %>% mutate(country = fct_relevel(country, order)) %>% ggplot(aes(x, y, colour = country, lty = country)) + geom_point() + geom_line() + scale_y_continuous(labels = dollar_format()) + scale_color_brewer(palette = "Dark2") + labs(title = "GDP Per Capita in Selected European Countries", y = "GDP Per Capita (1990 USD)", x = NULL, colour = "Country", lty = "Country") ``` ] ] --- class: inverse, center, middle # 7. Return to your familiar situation --- ### Thank you  ---